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Abstract 

We present a novel approach for learning non- 
linear dynamic models, which leads to a new 
set of tools capable of solving problems that 
are otherwise difficult. We provide theory 
showing this new approach is consistent for 
models with long range structure, and apply 
the approach to motion capture and high- 
dimensional video data, yielding results su- 
perior to standard alternatives. 



1. Introduction 

The notion of hidden states appears in many nonsta- 
tionary models of the world such as Hidden Markov 
Models (HMMs), which have discrete states, and 
Kalman filters, which have continuous states. Figure 1 
shows a general dynamic model with observation x t 
and unobserved hidden state yt. The system is char- 
acterized by a state transition probability P(yt+i|yt), 
and a state to observation probability P(xt|yt). 

The method for predicting future events under such a 
dynamic model is to maintain a posterior distribution 
over the hidden state y*+i, based on all observations 
X\-t = {x!, . . . , x t } up to time t. The posterior can be 
updated using the formula: 

P{yt+x\Xx..t) 

oc^P(y t |X 1:t _ 1 )P(x t |y t )P(y t+1 |y t ). (1) 

The prediction of future events xt+i, . . . , TCt+k, k > 0, 
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Figure 1. Dynamic Model with observation vector xt and 
hidden state vector y t . 

conditioned on X\ :t is through the posterior over y t : 

P(x t+ i, . . . ,x t+fe |xi :t ) 

cx P(y t+1 |Y 1:t )P(x t+ i, . . . ,x t+fc |y t+1 ). (2) 

Hidden state based dynamic models have a wide range 
of applications, such as time series forecasting, finance, 
control, robotics, video and speech processing. Some 
detailed dynamic models and application examples can 
be found in (West & Harrison, 1997). 

From Eq. 2, it is clear that the benefit of using a hid- 
den state dynamic model is that the information con- 
tained in the observation X±-t can be captured by a 
relatively small hidden state y*+i. Therefore in order 
to predict the future, we do not have to use all previ- 
ous observations X\ :t but only its state representation 
yt+i- In principle, yt+i may contain a finite history 
of length k + 1, such as x t , x t _i, . . . , Xt-fc. Although 
the notation only considers first order dependency, it 
incorporates higher order dependency by considering 
a representation of the form Y t = [y' t , y<i_i, ■ ■ ■ , y*-fc]> 
which is a standard trick. 

In an HMM or Kalman filter, both transition and ob- 
servation functions are linear maps. There are reason- 
able algorithms that can learn these linear dynamic 
models. For example, in addition to the classical EM 
approach, it was recently shown that global learning 
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of certain hidden Markov models can be achieved in 
polynomial time (Hsu ct al., 2008). Moreover, for lin- 
ear models, the posterior update rule is quite simple. 
Therefore, once the model parameters are estimated, 
such models can be readily applied for prediction. 

However in many real problems, the system dynamics 
cannot be approximated linearly. For such problems, 
it is often necessary to incorporate nonlinearity into 
the dynamic model. The standard approach to this 
problem is through nonlinear probability modeling, 
where prior knowledge is required to define a sensible 
state representation, together with parametric forms 
of transition and observation probabilities. The model 
parameters are learned by using probabilistic methods 
such as the EM (Wilson & Bobick, 1999; Roweis & 
Ghahramani, 2001). When the learned model is ap- 
plied for prediction purposes, it is necessary to main- 
tain the posterior P(y t \X\ :t ) using the update formula 
in Eq. 1. Unfortunately, for nonlinear systems, main- 
taining P(yt |xi : t) is generally difficult because the pos- 
terior can become exponentially more complex (e.g., 
exponentially many mixture components in a mixture 
model) as t increases. 

This computational difficulty is a significant obsta- 
cle to applying nonlinear dynamic systems to prac- 
tical problems. The traditional approach to address 
the computational difficulty is through approxima- 
tion methods. For example, in the particle filtering 
approach (Gordon et al., 1993; Arulampalam ct al., 
2002), one uses a finite number of samples to represent 
the posterior distribution and the samples are then up- 
dated as observations arrive. Another approach is to 
maintain a mixture of Gaussians to approximate the 
posterior, P{y t \X\._ t ), which may also be regarded as 
a mixture of Kalman filters (Chen & Liu, 2000). Al- 
though an exponential in t number of mixture com- 
ponents are needed to accurately represent the pos- 
terior, in practice, one has to use a fixed number of 
mixture components to approximate the distribution. 
This leads to the following question: even if the poste- 
rior can be well-approximated by a computationally 
tractable approximation family (such as finite mix- 
tures of Gaussians), how can one design a good ap- 
proximate inference method that is guaranteed to find 
a good quality approximation? The use of complex 
techniques required to design reasonable approxima- 
tion schemes makes it non-trivial to apply nonlinear 
dynamic models for many practical problems. 

This paper introduces an alternative approach, where 
we start with a different representation of a linear 
dynamic model which we call the sufficient poste- 
rior representation. It is shown that one can recover 



the underlying state representation by using predic- 
tion methods that are not necessarily probabilistic. 
This allows us to model nonlinear dynamic behaviors 
with many available nonlinear supervised learning al- 
gorithms such as neural networks, boosting, and sup- 
port vector machines in a simple and unified fashion. 
Compared to the traditional approach, it has several 
distinct advantages: 

• It does not require us to design any explicit state 
representation and probability model using prior 
knowledge. Instead, the representation is implic- 
itly embedded in the representational choice of the 
underlying supervised learning algorithm, which 
may be regarded as a black box with the power 
to learn an arbitrary representation. The prior 
knowledge can be simply encoded as input fea- 
tures to the learning algorithms, which signifi- 
cantly simplifies the modeling aspect. 

• It does not require us to come up with any spe- 
cific representation of the posterior and the corre- 
sponding approximate Bayesian inference schemes 
for posterior updates. Instead, this issue is ad- 
dressed by incorporating the posterior update as 
part of the learning process. Again, the posterior 
representation is implicitly embedded in the rep- 
resentational choice of the underlying supervised 
learning algorithm. In this sense, our scheme 
learns the optimal representation for posterior ap- 
proximation and the corresponding update rules 
within the representational power of the underly- 
ing supervised algorithm 1 . 

• It is possible to obtain performance guarantees 
for our algorithm in terms of the learning per- 
formance of the underlying supervised algorithm. 
The performance of the latter has been heavily 
investigated in the statistical and learning the- 
ory literature. Such results can thus be applied 
to obtain theoretical results on our methods for 
learning nonlinear dynamic models. 

2. Sufficient Posterior Representation 

Instead of starting with a probability model, our ap- 
proach directly attacks the problem of predicting yt+k 
based on X\. t . Clearly the prediction depends only 
on the posterior distribution P{y t +i\X 1:t ). Therefore 
we can solve the prediction problem as long as we can 
estimate, and update this posterior distribution. 

1 Many modern supervised learning algorithms are uni- 
versal, in the sense that they can learn an arbitrary repre- 
sentation in the large sample limit. 
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Figure 2. Dynamic Model with observation vector x t , hid- 
den state vector yt, and the posterior sufficient statistic 
vector St. 

In our approach, it is assumed that the posterior 
P (yt+i \X\-.t) can be approximated by a family of dis- 
tributions parameterized by s t+1 <E S: P(y t +i \X 1:t ) w 
P(yi + i|s t+ i) for some deterministic parameter St+i 
that depends on X\- t . That is, s t+ i is a sufficient 
statistic for the posterior P(y t +i \X\-t), and updating 
the posterior is equivalent to updating the sufficient 
statistic St+i. The augmented model that incorpo- 
rates the (approximate) sufficient statistics s t G S is 
shown in Fig. 2. In this model, y t can be integrated 
out, which leaves a model containing only s t and x t . 

According to the posterior update of Eq. 1 , there exists 
a deterministic function B such that: 

st+i = B(x t ,s t ). 

For simplicity, we can give an arbitrary value for the 
initial state Si, and let: 

s 2 = A(xi) = B(xi,si). 

Moreover, according to Eq. 2, given an arbitrary 
vector function / of the future events A t+ i :oc = 
{x t+ i, x t+ 2, • • • }, there exists a deterministic function 
C f (k > 0) such that: 

E Xt+1 ., 00 [f(X t+1:oo )\X 1:t }=Cf( St+1 ). 

Therefore the dynamics of the model in Fig. 1 is deter- 
mined by the posterior initialization rule A and pos- 
terior update rule B. Moreover, the prediction of the 
system is completely determined by the function . 

The key observation of our approach is that the func- 
tions A, B, and C are deterministic, which does not 
require any probability assumption. It fully captures 
the correct dynamics of the underlying probabilistic 
dynamic model. However, by removing the probabil- 
ity assumption, we obtain a more general and flexi- 
ble model. In particular, we are not required to start 
with specific forms of the transition model P(y t+ i |y t ), 
the observation model P(x t |y t ), or the posterior suf- 
ficient statistic model P(y t+1 |xi :4 ) « P{y t +\ |s i+ i) , as 



required in the standard approach. Instead, we may 
embed the forms of such models into the functional 
approximation forms in standard learning algorithms, 
such as neural networks, kernel machines, or tree en- 
sembles. These are universal learning machines that 
are well studied in the learning theory literature. 

Our approach essentially replaces a stochastic hidden 
state representation through the actual state Y by a 
deterministic representation through the posterior suf- 
ficient statistic S. Although the corresponding repre- 
sentation may become more complex (which is why 
in the traditional approach, y t is always explicitly in- 
cluded in the model), this is not a problem in our 
approach, because we do not have to know the ex- 
plicit representation. Instead, the complexity is incor- 
porated into the underlying learning algorithm — this 
allows us to take advantage of sophisticated modern 
supervised learning algorithms that can handle com- 
plex functional representations. Moreover, unlike the 
traditional approach, in which one designs a specific 
form of P(y t |s t ) by hand, and then derives an approx- 
imate update rule B by hand using Bayesian inference 
methods, here, we simply use learning to come up with 
the best possible representation and update (assuming 
the underlying learning algorithm is sufficiently pow- 
erful). We believe this approach is also more robust 
because it is less sensitive to model mis-specifications 
or non-optimal approximate inference algorithms that 
commonly occur in practice. 

By changing the standard probabilistic dynamic model 
in Fig. I to its sufficient posterior representation in 
Fig. 2 (where we assume yt is integrated out, and thus 
can be ignored) , we can define the goal of our learning 
problem. Since yt is removed from the formulation, in 
the following, we shall refer to the sufficient posterior 
statistic s t simply as state. 

We can now introduce the following definition of Suf- 
ficient Posterior Representation of Dynamic Model, 
which we refer to SPR-DM. 

Definition 2.1. (SPR-DM) A sufficient posterior rep- 
resentation of a dynamic model is given by an ob- 
served sequence {x t } and unobserved hidden state {s t }, 
characterized by state initialization map s 2 = A(xi), 
state update map St+i = -B(x t , s t ), and state prediction 
maps: 

E Xt+1:00 [f(X t+1:oo )\X 1:t }=Cf(s t+1 ) 

for any pre- determined vector function C* . 

Our goal in this model is to learn the model dynamics 
characterized by A and B, as well as for any given 
vector function of interest. 
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Figure 3. Left Panel: A state defining prediction. At training time, xi and X2 are known. The essential goal is to 
predict X2 given xi using bottleneck hidden variables S2. Two distinct mappings A and C are learned, with S2 = A(xi). 
Middle Panel: A state evolution prediction. At training time, x t _i and st_i are used to predict s t via the operator 
B(xt_i,st_i) such that xt is reproduced via C(s t ). Right Panel: A state projection prediction. At training time, s t _ 2 J 
is used to predict s t such that x t is reproduced via C(s t ) for j £ {0, 1, 2, .... [log 2 T\}. 



3. Learning SPR-DM 

The essential idea of our algorithm is to use a bot- 
tlenecking approach to construct an implicit definition 
of state, along with state space evolution and projec- 
tion operators to answer various natural questions we 
might pose. 

3.1. Training 

There are two parts to understanding the training pro- 
cess. The first is the architecture trained, and the 
second is the exact method of training this architec- 
ture. Note that our architecture is essentially func- 
tional rather than representational. 

3.1.1. Architecture 

Graphically, in order to recover the system dynamics, 
we solve two distinct kinds of prediction problems. To 
understand these graphs it is essential to understand 
that the arrows do not represent graphical models. In- 
stead, they are a depiction of which information is used 
to predict which other information. We distinguish 
observations and hidden "state" as double circles and 
circles respectively, to make clear what is observed and 
what is not. 

The first prediction problem solved in Fig. 3, left 
panel, provides our initial definition of state. Essen- 
tially, state is "that information which summarizes the 
first observation in predicting the second observation" . 
Compared to a conventional dynamic model, the quan- 
tity S2 may be a sufficient statistic of the state poste- 
rior after integrating x\, the posterior after integrating 
x\ and evolving one step or some intermediate mix- 
ture. This ambiguity is fundamental, but inessential. 

The second prediction problem is state evolution, 
shown in Fig. 3, middle panel. Here, we use a state 
and an observation to predict the next state, reusing 
the prediction of state from observation from the first 



step. Note that even though there are two sources 
of information for predicting s t , only one prediction 
problem (using both sources) is solved. Operator B 
is what is used to integrate new information into the 
state of an online system. 

Without loss of generality, in the notation of Fig. 3 we 
consider /o(A" t+ i :oc ) = _E[x t+ i|A"i :t ], and denote 
by C . An alternative interpretation of C, which we do 
not distinguish in this paper, is to learn the probability 
distribution over x t+ i. It should be understood that 
our algorithm can be applied with other choices of /o- 

The above two learning diagrams are used to obtain 
the system dynamics (A and B). One can then use 
the learned system dynamics to learn prediction rules 
C f with any function / of interest. Here, we con- 
sider the problem of predicting x t+ ^ at different ranges 
of k = 2 J . This gives a state projection operator 
Dj : s t — > s t+2 j , without observing the future sequence 
x t+ i , x t+2 , • • • . The learning of state projection is pre- 
sented in Fig. 3, right panel. The idea in state projec- 
tion is that we want to build a predictor of the ob- 
servation far in the future. To do this, we'll chain 
together several projection operators from the current 
state. To make the system computationally more effi- 
cient, we learn [log 2 TJ operators, each specialized to 
cover different timespans. Note that state evaluation 
provides an efficient way to learn x t+ fe based on s t si- 
multaneously for multiple k through combination of 
projection operators. If computation is not an issue, 
one may also learn x t +k based on s t separately for each 
k. 

3.1.2. Method 

Training of A is straightforward. Training of C is 
complicated by the fact that samples appear at multi- 
ple timesteps, but otherwise straightforward given the 
other components. To deal with multiple timesteps, 
it is important for our correctness proof in section 4.2 
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that the observation x t include the timestep t. The 
training of D is also straightforward given everything 
else (and again, we'll require the timestep be a part of 
the update for the correctness proofs). 

The most difficult thing to train is B, since an alter- 
ation to B can cascade over multiple timesteps. The 
method we chose takes advantage of both local and 
global information to provide a fast near-optimal so- 
lution. 

f. Initialization: Learn B t ,Ct starting from 
timestep t = 1 and conditioning on the previ- 
ous learned value. Multitask learning or initial- 
ization with prior solutions may be applied to im- 
prove convergence here. In our experiments, we 
initialize B t ,C t to the average parameter values 
of previous timesteps and use stochastic gradient 
descent techniques for learning. 

2. Conditional Training Learn an alteration B' 
which optimizes performance given that the ex- 
isting B t are used at every other time step. Since 
computational performance is an issue, we use a 
"backprop through time" gradient descent style 
algorithm. For each timestep t, we compute the 
change in squared loss for all future observations 
using the chain rule, and update according to the 
negative gradient. 

3. Iteration: Update B using stochastic mixing ac- 
cording to Bi = aB' + (1 — a)Bi_i where a is the 
stochastic mixing parameter. The precise method 
of stochastic mixing used in the experiments is 
equivalent to applying the derivative update with 
probability a and not update with probability 
I — a, which is a computational and represen- 
tational improvement over Searn (Daume et al., 
2009). 

We prove (below) that the method in step (1) alone is 
consistent. Steps (2) and (3) are used to force conver- 
gence to a single B and C while retaining the perfor- 
mance gained in step (1). The intuition behind step 
(3) is that when a = o(y), with high probability B' is 
executed only once, implying that B' need only per- 
form well with respect to the learning problem induced 
by the rest of the system to improve the overall sys- 
tem. This approach was first described in Conservative 
Policy Iteration (Kakade & Langford, 2002). 

3.2. Testing 

We imagine testing the algorithm by asking questions 
like: what is the probability of observation x t / given 



what is known up to time t for t' > t? This is done by 
using -A(xi) to get s 2 , then using £?(xj,Sj) to evolve 
the state to s t . Then the time interval from t' — t is 
broken down into factors of 2, and the corresponding 
state projection operators Dj are applied to the state 
resulting in a prediction for Sf-i- This is transformed 
into a prediction for x t / using operator C. 

4. Analysis 

4.1. Computation 

The computational requirements depend on the ex- 
act training method used. For the initialization step, 
training of A, B t , and C t requires just 0(nT) exam- 
ples. Training Di can be done with just 0(nT log 2 T) 
examples. For the iterative methods, an extra factor 
of T is generally required per iteration for learning B. 

4.2. Consistency 

We now show that under appropriate assumptions, the 
SPR-DM model can be learned in the infinite sample 
limit using our algorithm. Due to the space limitation, 
we only consider the non-agnostic situation, where the 
SPR-DM model is exact. That is, the functions A, B, 
C used in our learning algorithm contains the correct 
functions. The agnostic setting, where the SPR-DM 
model is only approximately correct, can be analyzed 
using perturbation techniques (e.g., for linear systems, 
this is done in (Hsu et al., 2008)). Although such anal- 
ysis is useful, the fundamental insight is identical to 
the non-agnostic analysis considered here. 

We consider the following constraints in the SPR-DM 
model. We assume that the model is invertible: The 
distribution over x 4 (more generally, the definition can 
be extended to other vector functions </>o(xt, •■•,)) is 
a sufficient statistic for the state s t that generates x t . 
This is a nontrivial limitation of state based dynamic 
models which retains the ability to capture long range 
dependencies. 

Definition 4.1. (Invertible SPR-DM) The SPR-DM 
in Definition 2. 1 is invertible if there exist a function 
E such that for all t, E(C^(s t )) — s t . 

Invertibility is a natural assumption, but it's impor- 
tant to understand that invertible dynamic systems 
are a subset of dynamic systems as shown by the fol- 
lowing hidden Markov model example: 

Example 4.1 A hidden Markov model which is not 
invertible: Suppose there are two observations, and 
1 where the first observation is uniform random, the 
second given the first is always 0, and the third is the 
same as the first. Under this setting, the two valid 
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sequences are 000 and 101. There is a hidden Markov 
model which is not invertible that can express this se- 
quence. In particular, suppose state Si is (0, 1) or (1, 1) 
and state s 2 is (0,2) or (1,2), with a conditional ob- 
servation that is P(0|*,1) = 1 and P(0|0,2) = 1 and 
P(0|l,2) = 0. However, no invertible hidden Markov 
model can induce a distribution over these sequences 
because the distribution on x 2 is always 0, implying 
that a specification of state is impossible due to lack 
of information. 

Although Invertible SPR-DMs form a limited subset 
of SPR-DMs, they are still nontrivial as the following 
example shows. 

Example 4.2 An Invertible hidden Markov model with 
long range dependencies: Suppose there are two ob- 
servations and 1 and two states s\ and s 2 . Let the 
first observation always be and the first state be uni- 
form random P(s i\0) = P(s 2 |0) = 0.5. Let the states 
only self-transition according to P(si|si) = 1 and 
P{s2\s 2 ) — 1- Let the observations be according to the 
following distribution: P(0|si) = 0.75, P(0|s 2 ) = 0.25. 
Given only one observation, the probability of state 
Si is 0.75 or 0.25 for observations or 1 respectively. 
Given T observations, the probability of state s\ con- 
verges to or 1 exponentially fast in T using Bayes 
Law and the Chernoff bound. 

The above two examples illustrate the intuition behind 
invertibility. One can extend the concept by incorpo- 
rating look aheads: that is, instead of taking C as the 
probability of x t given s t , we may let C be the prob- 
ability of Xf.t+k given s t . This broadens the class of 
invertible models. In this notation, invertibility means 
that if two states s t and s' t induce the same short range 
behavior X t:tk , then they are identical in the sense 
they induce the same behavior for all future obser- 
vations: X t +i;oo- Generally speaking, non-invertiblc 
models are those that cannot be efficiently learned by 
any algorithm because we do not have sufficient infor- 
mation to recover states that have different long range 
dynamics but identical behavior in short ranges. In 
fact, there are well-known hardness results for learn- 
ing such models in the theoretical analysis of hidden 
Markov models. There are no known efficient meth- 
ods to capture non-trivial long-range effects. This im- 
plies that our restriction is not only necessary, but also 
not a significant limitation in comparison to any other 
known efficient learning algorithms. 

Next we prove that our algorithm can recover any in- 
vertible hidden Markov model given sufficiently pow- 
erful prediction with infinitely many samples. This is 
analogous to similar infinite-sample consistency results 
for supervised learning. 



Theorem 4.1. (Consistency) For all Invertible SPR- 
DMs, if all prediction problems are solved per- 
fectly, then for all i, p(xj|xi, Xj_i) is given by: 
£(B(x i _i,B(x i _ 2 ,...,i(x 1 )...))). 

A similar theorem statement holds for projections. 

Proof. The proof is by induction. 

The base case is C(A(xi)) = C 2 (A(xi)) which holds 
under the assumption that the prediction problem 
is solved perfectly. In the inductive case, define: 
s 2 = A(xi), s 2 = A(x 1 ), Si = £(xj_i,Si_i), Sj = 
£?i(xi_i, Sj_i) and assume C(sj) = Cj(sj). Invertibil- 
ity and the inductive assumption implies there exists 
E such that: Sj = E(Ci(si)). Consequently, there ex- 
ists Ci + \ = C and Bj + i(xj,Sj) = B(xj, E(C(si))) such 
that: 

C(B(xi,Si)) = Ci +1 (B i+1 (xi,Si) 
proving the inductive case. □ 

5. Experiments 

In this section we present experimental results on 
two datasets that involve high-dimensional, highly- 
structured sequence data. The first dataset is the mo- 
tion capture data that comes from CMU Graphics Lab 
Motion Capture Database. The second dataset is the 
Weizmann dataset 2 , which contains video sequences of 
nine human subjects performing various actions. 

5.1. Details of Training 

While the introduced framework allows us to use many 
available nonlinear supervised learning algorithms, in 
our experiments we use the following parametric forms 
for our operators: 

s 2 = i4(xi) =a(A T x 1 +b) , 

Si = B(x t _i,s t _i) = a (Bj x t _i + Bj s t _i + b) , 

x t = C(s t ) = C T s t + a, 

s t+2 j = Dj(s t ) = Djs t + d, 

where a{y) = 1/(1 + cxp(— y)) is the logistic function, 
applied componentwise, {C, B, A, Dij,a, b, d} are the 
model parameters with a, b and d representing the 
bias terms. 

For both datasets, during the initialization step, the 
values of {B t ,C t } are initialized to the average pa- 
rameter values of previous timestcps 3 . Learning of 

2 Available at http://www.wisdom.weizmann.ac.il/ 
-^vision / SpaceTimeActions.html. 

3 The values of A, Bi, C\ were initialized with small ran- 
dom values sampled from a zero-mean normal distribution 
with standard deviation of 0.01. 
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Figure 4. Left panel: compares the average squared test error as a function of prediction horizon for three models: 
two linear autoregressive models when conditioning on 2 and 5 previous time steps, and the nonlinear model that uses a 
20-dimensional hidden state. Right panel: compares nonlinear model with 20-state and 100-state HMM models. The 
average predictor always predicts a vector of zeros. 



{B t , Ct] then proceeds by minimizing the squared loss 
using stochastic gradient descent. For each time step, 
we use 500 parameter updates, with learning rate 
of 0.001. We then used 500 iterations of stochastic 
mixing, using gradients obtained by backpropagation 
through time. The stochastic mixing rate a was set 
to 0.9 and was gradually annealed towards zero. We 
experimented with various values for the learning rate 
and various annealing schedules for the mixing rate a. 
Our results are fairly robust to variations in these pa- 
rameters. In all experiments we were conditioning on 
the two previous time steps to predict the next. 

5.2. Motion Capture Data 

The human motion capture data consists of sequences 
of 3D joint angles plus body orientation and transla- 
tion. The dataset was preprocessed to be invariant to 
isometries (Taylor et al., 2006), and contains various 
walking styles, including normal, drunk, graceful, gan- 
gly, sexy, dinosaur, chicken, and strong. We split at 
random the data into 30 training and 8 test sequences, 
each of length 50. The training data was further split 
at random into the 25 training and 5 validation se- 
quences. Each time step was represented by a vector 
of 58 real-valued numbers. The dataset was also nor- 
malized to have zero mean was scaled by a single num- 
ber, so that the variance across each dimension was on 
average equal to 1. The dimensionality of the hidden 
state was set to 20. 

Figure 4 shows the average test prediction errors us- 
ing squared loss, where the prediction horizon ranges 
over 1,2,4,8,10,16, and 25. The nonlinear model was 
compared to two simple autoregressive linear models 
that operate directly in the input space. The first lin- 
ear model, LINEAR-2, makes predictions x t +k via the 



linear combination of the two previous time steps: 

x t+fe = L lT x t +L 2T x f _i +1. (3) 

The model parameters {L 1 , L 2 , 1} were fit by ridge re- 
gression. The second model, LINEAR-5, makes pre- 
dictions by conditioning on the previous five time 
steps. We note that the number of the model pa- 
rameters for these simple autoregressive linear models 
grows linearly with the input information. Hence when 
faced with high-dimensional sequence data, learning 
linear operators directly in the input space is unlikely 
to perform well. 

It is interesting to observe that autoregressive linear 
models perform quite well in terms of making short- 
range predictions. This is probably due to the fact 
that locally, motion capture data is linear. However, 
the nonlinear model performs considerably better com- 
pared to both linear models when making long-range 
predictions. Figure 4 (right panel) further shows that 
the proposed nonlinear model performs considerably 
better than 20 and 100-state HMM's. Both HMM's 
use Gaussian distribution as their observation model. 
It is obvious that a simple HMM model is unable to 
cope with complex nonlinear dynamics. Even a 100- 
state HMM is unable to generalize. 

5.3. Modeling Video 

Results on the motion capture dataset show that 
a nonlinear model can outperform linear and HMM 
models, when making long-range predictions. In this 
section we present results on the Weizmann dataset, 
which is considerably more difficult than the motion 
capture dataset. 

The Weizmann dataset contains video sequences of 
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Figure 5. Left panel: compares the average squared test error for three models: two linear autoregressive models, and 
the nonlinear model that uses a 50-dimensional hidden state. Right panel: compares nonlinear model to 50-state and 
100-state HMM models. 



nine human subjects performing various actions, in- 
cluding waving one hand, waving two hands, jumping, 
and bending. Each video sequence was preprocessed 
by placing a bounding box around a person perform- 
ing an action. The dataset was then downsampled to 
29 x 16 images, hence each time step was represented 
by a vector of 464 real-valued numbers. We split at 
random the data into into 36 training (30 training and 
6 validation), and 10 test sequences, each of length 50. 
The dataset was also normalized to have zero mean 
and variance 1. The dimension of the hidden state 
was set to 50. 

Figure 5 shows that the nonlinear model consistently 
outperforms both linear autoregressive and HMM 
models, particularly when making long-range predic- 
tions. It is interesting to observe that on this dataset, 
the nonlinear model outperforms the autoregressive 
model even when making short-range predictions. 

6. Conclusions 

In this paper we introduced a new approach to learning 
nonlinear dynamical systems and showed that it per- 
forms well on rather hard high-dimensional time series 
datasets compared to standard models such as HMMs 
or linear predictors. We believe that the presented 
framework opens up an entirely new set of devices for 
nonlinear dynamic modeling. It removes several ob- 
stacles in the traditional approach that requires heavy 
human design, and allows well-established supervised 
learning algorithms to be used automatically for non- 
linear dynamic models. 
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